Skip to content

BUG: pandas.to_datetime() does not respect exact format string with ISO8601 #49333

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 26 commits into from
Nov 17, 2022

Conversation

nikitaved
Copy link
Contributor

@nikitaved nikitaved commented Oct 26, 2022


EDIT (by MarcoGorelli): this is something @nikitaved, @fdrocha, and I started working on together at a company retreat

Co-Authored-By: MarcoGorelli <>
Co-Authored-By: FDRocha <>
@MarcoGorelli MarcoGorelli marked this pull request as ready for review October 28, 2022 07:43
@MarcoGorelli MarcoGorelli added the Datetime Datetime data dtype label Oct 28, 2022
Comment on lines 70 to 85
#define FORMAT_STARTSWITH(ch) \
if (exact) { \
if (!format_len || *format != ch) { \
goto parse_error; \
} \
++format; \
--format_len; \
} else { \
if (format_len > 0) { \
if (*format != ch) { \
goto parse_error; \
} \
++format; \
--format_len; \
} \
} \
Copy link
Contributor Author

@nikitaved nikitaved Oct 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super minor nit: maybe this could be a bit easier to parse?

// Always error on character mismatch conditioned on non-exhausted format, or when format is exhausted in the exact case.
if ((format_len && *format != ch) || (exact && !format_len)) goto parse_error;
// Advance if format is not exhausted
if (format_len) {
  ++format;
  --format_len;
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for your suggestion! are you sure this is right though? if exact=False and we're at the end of the format string, won't *format != ch resolve to True and so go to the parse_error? At least, if I try this, then I get a few test failures, e.g.

(.venv) megorelli@penguin:~/pandas-dev$ cat f.py 
import pandas as pd

pd.to_datetime(['2022-01-01'], format='%Y-%m', exact=False)

(.venv) megorelli@penguin:~/pandas-dev$ python f.py 
Traceback (most recent call last):
  File "/home/megorelli/pandas-dev/f.py", line 3, in <module>
    pd.to_datetime(['2022-01-01'], format='%Y-%m', exact=False)
  File "/home/megorelli/pandas-dev/pandas/core/tools/datetimes.py", line 1130, in to_datetime
    result = convert_listlike(argc, format)
  File "/home/megorelli/pandas-dev/pandas/core/tools/datetimes.py", line 441, in _convert_listlike_datetimes
    result, tz_parsed = objects_to_datetime64ns(
  File "/home/megorelli/pandas-dev/pandas/core/arrays/datetimes.py", line 2205, in objects_to_datetime64ns
    result, tz_parsed = tslib.array_to_datetime(
  File "pandas/_libs/tslib.pyx", line 440, in pandas._libs.tslib.array_to_datetime
    cpdef array_to_datetime(
  File "pandas/_libs/tslib.pyx", line 622, in pandas._libs.tslib.array_to_datetime
    raise ValueError(
ValueError: time data "2022-01-01" at position 0 doesn't match format "%Y-%m"

Copy link
Contributor Author

@nikitaved nikitaved Oct 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The format exhaustion is not tested outside of the macro, right? Maybe iterating over the format string is better than over the original string in the external loop, especially for the case when exact=False.

Copy link
Contributor Author

@nikitaved nikitaved Oct 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MarcoGorelli , I have updated the check so that *format != ch is conditioned on format_len > 0. Would that work now? I would still argue that changing this whole function to iterate over format is a better approach, but I might be missing some use cases for which this is not true...

Copy link
Member

@MarcoGorelli MarcoGorelli Oct 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's cleaner, thanks!

Regarding iterating over format - not sure that'd be possible, because there are parts of the codebase which reach here without passing format. An option could be to do something like:

  • if format == '' and not exact then use the current path,
  • else if format != '', then iterate over format until it's exhausted, then use the current codepath
    but I think that would add even more complexity (if I've understood correctly)

@@ -445,6 +445,8 @@ cpdef array_to_datetime(
bint utc=False,
bint require_iso8601=False,
bint allow_mixed=False,
str format="",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use None as a default rather than ""? I think can just type as Optional[str] which would be a bit cleaner

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, could do that here, but then wouldn't it still need to be str when it's passed to the C function in

format_buf = get_c_string_buf_and_size(format, &format_length)
return parse_iso_8601_datetime(buf, length, want_exc,
dts, out_bestunit, out_local, out_tzoffset,
format_buf, format_length, exact)

? If I understand correctly, at some point the conversion from None to "" would have to happen anyway

Copy link
Contributor Author

@nikitaved nikitaved Oct 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If None can be interpreted as nullptr (can we just pass 0?), I agree that having None is a cleaner option for something that is user-facing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't user-facing, it's just used internally

Not sure I understand about None being interpreted as nullptr, but if I pass None to get_c_string_buf_and_size I get

Traceback (most recent call last):
  File "t.py", line 109, in <module>
    print(to_datetime('2012-01-01T10:00', format='%Y-%m-%dT%H:%M', exact=False))
  File "/home/marcogorelli/pandas-dev/pandas/core/tools/datetimes.py", line 1117, in to_datetime
    result = convert_listlike(np.array([arg]), format)[0]
  File "/home/marcogorelli/pandas-dev/pandas/core/tools/datetimes.py", line 441, in _convert_listlike_datetimes
    result, tz_parsed = objects_to_datetime64ns(
  File "/home/marcogorelli/pandas-dev/pandas/core/arrays/datetimes.py", line 2177, in objects_to_datetime64ns
    result, tz_parsed = tslib.array_to_datetime(
  File "pandas/_libs/tslib.pyx", line 440, in pandas._libs.tslib.array_to_datetime
    cpdef array_to_datetime(
  File "pandas/_libs/tslib.pyx", line 706, in pandas._libs.tslib.array_to_datetime
    raise
  File "pandas/_libs/tslib.pyx", line 606, in pandas._libs.tslib.array_to_datetime
    string_to_dts_failed = string_to_dts(
  File "pandas/_libs/tslibs/np_datetime.pyx", line 288, in pandas._libs.tslibs.np_datetime.string_to_dts
    format_buf = get_c_string_buf_and_size(None, &format_length)
  File "pandas/_libs/tslibs/util.pxd", line 221, in pandas._libs.tslibs.util.get_c_string_buf_and_size
    return PyUnicode_AsUTF8AndSize(py_string, length)
TypeError: bad argument type for built-in operation

So I think the conversion between None and str needs to happen at some point internally anyway

If exact if False and format is '', then it should match anything

Copy link
Contributor Author

@nikitaved nikitaved Oct 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about?

if format is None:
  # Make format_buf a nullptr
  format_buf = 0
  format_lenght = 0
else:
  format_buf = get_c_string_buf_and_size(..., &format_length)
...

return parse_iso...

The rest of the code should work just fine given that the checks are conditioned on format_lenght.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK yeah, looks good - I presume format_buf would have to be '\0' rather than 0 though?

Copy link
Contributor Author

@nikitaved nikitaved Oct 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as format_len is 0, *format_buf could be anything, so it is a bit clearer to have the pointer set to NULL, imho.

Copy link
Member

@MarcoGorelli MarcoGorelli Oct 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how though? because if it's 0, then we get

[1/1] Cythonizing pandas/_libs/tslibs/np_datetime.pyx

Error compiling Cython file:
------------------------------------------------------------
...

    buf = get_c_string_buf_and_size(val, &length)
    format_buf = get_c_string_buf_and_size(format, &format_length)
    return parse_iso_8601_datetime(buf, length, want_exc,
                                   dts, out_bestunit, out_local, out_tzoffset,
                                   0, 0, exact)
                                  ^
------------------------------------------------------------

pandas/_libs/tslibs/np_datetime.pyx:290:35: Cannot assign type 'long' to 'const char *'

And with None:

Error compiling Cython file:
------------------------------------------------------------
...

    buf = get_c_string_buf_and_size(val, &length)
    format_buf = get_c_string_buf_and_size(format, &format_length)
    return parse_iso_8601_datetime(buf, length, want_exc,
                                   dts, out_bestunit, out_local, out_tzoffset,
                                   None, 0, exact)
                                  ^
------------------------------------------------------------

pandas/_libs/tslibs/np_datetime.pyx:290:35: Cannot assign None to const char *

Copy link
Contributor Author

@nikitaved nikitaved Oct 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see, it is a cast issue. Does Cython support casts to C types? If so, something like format_buf = (const char*) 0 should do. Or, maybe, we can specify the type of format_buf in advance... Sorry, I am not an expert in Cython :)
Maybe this will work?

if format is None:
  format_buf: const char* = 0
  format_len = 0
else:
  format_buf = get_c_string_buf_and_size(val, &format_len)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried

format_buf = <const char*> 0

but then get

Segmentation fault

(with no further info)

For now I've gone with format = b'', and handling None inside this function does indeed look cleaner, thanks!

@@ -66,10 +66,24 @@ This file implements string parsing and creation for NumPy datetime.
*
* Returns 0 on success, -1 on failure.
*/

#define FORMAT_STARTSWITH(ch) \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make this a function instead of a macro?

Copy link
Contributor Author

@nikitaved nikitaved Oct 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The macro modifies two variables and has access to the goto location. A separate function then has to have at least two variables accepted with one of them being a pointer to a pointer. It also has to return a flag that needs a check every single time it is called. We can remove this pointer to a pointer, however, and modify just an offset variable, but checks are unavoidable. Is that a preferred approach? What do you think about checks being potentially redirected to the goto location in a macro? Or we could refactor the code and remove the goto once and for all, and handle the error message in a macro.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can also just do one function that performs the check and in the caller run the goto in the failing branch while advancing the pointers in the succeeding branch. It is more verbose but also more explicit than the macro.

@WillAyd
Copy link
Member

WillAyd commented Oct 28, 2022

This is vendored from numpy - can you also check if there's something like this upstream? If not may be worth doing there as well

@MarcoGorelli MarcoGorelli changed the title ISO8601-exact support in to_datetime BUG: pandas.to_datetime() does not respect exact format string with ISO8601 Oct 30, 2022
Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your review - I've changed the macro to a function

Regarding the numpy function this is vendored from - that one's a lot stricter, it only allows - as a separator

https://github.com/numpy/numpy/blob/7653829fa0d409d5850566cda1c031d9fadc1c22/numpy/core/src/multiarray/datetime_strings.c#L402-L413

whereas pandas allows multiple separators (so long as they're consistent)

char valid_ymd_sep[] = {'-', '.', '/', '\\', ' '};

so I don't think this would be useful to them

Regarding whether the pandas function should become stricter - not sure, it's not user-facing anyway, it just gives users a fast-path for ISO8601-like date strings

@@ -445,6 +445,8 @@ cpdef array_to_datetime(
bint utc=False,
bint require_iso8601=False,
bint allow_mixed=False,
str format="",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, could do that here, but then wouldn't it still need to be str when it's passed to the C function in

format_buf = get_c_string_buf_and_size(format, &format_length)
return parse_iso_8601_datetime(buf, length, want_exc,
dts, out_bestunit, out_local, out_tzoffset,
format_buf, format_length, exact)

? If I understand correctly, at some point the conversion from None to "" would have to happen anyway

@@ -66,10 +66,25 @@ This file implements string parsing and creation for NumPy datetime.
*
* Returns 0 on success, -1 on failure.
*/

inline int format_startswith(char ch, int format_len, char format, int exact) {
Copy link
Member

@WillAyd WillAyd Oct 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still not sure about this function. What is the advantage of this over using strncmp? Seems like it would be cleaner to just use that plus some combination of exact outside of this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I think I understand a bit more now what you are trying to do. I think it would be best if you just kept a local variable for characters consumed, which you can increment every time you increment the format pointer (you are kind of doing this anyway). You don't really need a function but can rather just do if (format++ != '%') { // handle error }

The order of operations will do the postfix increment after assignment, so this consolidates what you are trying to do

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checking I've understood, is the suggestion to get rid of the function and each time do something like

if (((format_len > 0) && (format++ != '%')) || (exact && (format_len<=0))){
    if (format_len > 0) format_len--;
    goto parse_error;
}

?

I presume I've misunderstood because to me this looks more complicated and loses the docstring 😄

Copy link
Contributor Author

@nikitaved nikitaved Nov 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible to even do everything in one single if statement if ((exact && !format_len) || (format_len && format_len-- && *format++ != ch)) goto ... , but it then starts to look as one of these tricky C language questions... This explicit function has comments, it is more clear imho, and is very likely to get inlined.

Copy link
Member

@WillAyd WillAyd Nov 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its not a matter of inlining as much as code organization. For instance, this block:

    if (!format_startswith('%', format_len, *format, exact)) goto parse_error;
    if (format_len) {++format; --format_len;}
    if (!format_startswith('Y', format_len, *format, exact)) goto parse_error;
    if (format_len) {++format; --format_len;}

Would seem more naturally expressed as:

if (format_len < 2 && exact) {
  goto parse_error;
} 

if (*format++ != '%') {
  goto parse_error;
} 

if (*format++ != 'Y') {
  goto parse_error;
}

There's some code golf you can do therein to shorten it but that's not really what I'm after. Moreso we should refactor this to avoid trying to cram a lot into a function with side effects, as we don't use that pattern much if at all in our code base, and a lot of the character by character checks being done are overkill

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other advantage of this is you could provide an actual error message as to what went wrong at what position of the format, though not critical to scope for this PR

Copy link
Contributor Author

@nikitaved nikitaved Nov 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Can't we just refactor everything and replace it with something like

  1. match given format with the full format and return position of mismatch/success that depends on lengths and exactness.
  2. Parse the string with sprintf or something like this, with additional validity checks. Provided that 1 is a success

?

Unless doing everything in one go is a performance decision...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you see a better way to refactor I am definitely open to it. I don't think the performance here will be that critical.
Probably better served as a pre cursor before adding functionality here, or alternately as a follow up to the "side-effect-less" design mentioned already

@@ -104,19 +119,28 @@ int parse_iso_8601_datetime(const char *str, int len, int want_exc,
while (sublen > 0 && isspace(*substr)) {
++substr;
--sublen;
if (!format_startswith(' ', format_len, *format, exact)) goto parse_error;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit but can you run clang-format against this file? I think it would format this differently

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making really nice progress here. Mostly a suggestion on branching layout that I think can be applied to all cases

@@ -104,19 +105,30 @@ int parse_iso_8601_datetime(const char *str, int len, int want_exc,
while (sublen > 0 && isspace(*substr)) {
++substr;
--sublen;
if (format_len < 1 && exact) goto parse_error;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (format_len < 1 && exact) goto parse_error;
if (format_len < 1 && exact) {
goto parse_error;
}

I think clang-format will automatically enforce this style for you

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried clang-format and it formatted it without the braces, i.e.

        if (format_len < 1 && exact)
          goto parse_error;

so I've gone with that for now

}

if (sublen == 0) {
goto parse_error;
}

/* PARSE THE YEAR (4 digits) */
if (format_len < 2 && exact) goto parse_error;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (format_len < 2 && exact) goto parse_error;
if (format_len < 2) {
if (exact) {
goto parse_error;
}
// What happens here?
} else {
if (*format++ != '%' && *format++ != 'Y') {
goto parse_error;
}
format_len -= 2;
}

Current code has undefined behavior and a possible segfault if format_len is 1 while exact is False. Need to think through what action should be taken there

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In practice that shouldn't be possible - this function's only used internally, and

def format_is_iso(f: str) -> bint:
"""
Does format match the iso8601 set that can be handled by the C parser?
Generally of form YYYY-MM-DDTHH:MM:SS - date separator can be different
but must be consistent. Leading 0s in dates and times are optional.
"""
iso_template = '%Y{date_sep}%m{date_sep}%d{time_sep}%H:%M:%S{micro_or_tz}'.format
excluded_formats = ['%Y%m%d', '%Y%m', '%Y']
for date_sep in [' ', '/', '\\', '-', '.', '']:
for time_sep in [' ', 'T']:
for micro_or_tz in ['', '%z', '%Z', '.%f', '.%f%z', '.%f%Z']:
if (iso_template(date_sep=date_sep,
time_sep=time_sep,
micro_or_tz=micro_or_tz,
).startswith(f) and f not in excluded_formats):
return True
return False

would return false if the format were for example %Y-%. I've changed it to raise if format_len isn't 0 anyway, thanks

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea what you will find with C is that you want to be extremely explicit. Even if it's not possible today, if someone refactors the code and it does happen things become very difficult to debug.

Our error checking in some of our extensions is really loose, which is why our internal code is relatively difficult to refactor

@@ -104,19 +105,30 @@ int parse_iso_8601_datetime(const char *str, int len, int want_exc,
while (sublen > 0 && isspace(*substr)) {
++substr;
--sublen;
if (format_len < 1 && exact) goto parse_error;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (format_len < 1 && exact) goto parse_error;
if (format_len < 1) {
if (exact) {
goto parse_error;
}
// Do something?
} else {
if (*format++ != ' ') {
goto parse_error;
}
format_len--;
}

I think this control flow would work more universally and avoid logic pitfalls / segfaults (see other comment)

Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @WillAyd for your review!

}

if (sublen == 0) {
goto parse_error;
}

/* PARSE THE YEAR (4 digits) */
if (format_len < 2 && exact) goto parse_error;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In practice that shouldn't be possible - this function's only used internally, and

def format_is_iso(f: str) -> bint:
"""
Does format match the iso8601 set that can be handled by the C parser?
Generally of form YYYY-MM-DDTHH:MM:SS - date separator can be different
but must be consistent. Leading 0s in dates and times are optional.
"""
iso_template = '%Y{date_sep}%m{date_sep}%d{time_sep}%H:%M:%S{micro_or_tz}'.format
excluded_formats = ['%Y%m%d', '%Y%m', '%Y']
for date_sep in [' ', '/', '\\', '-', '.', '']:
for time_sep in [' ', 'T']:
for micro_or_tz in ['', '%z', '%Z', '.%f', '.%f%z', '.%f%Z']:
if (iso_template(date_sep=date_sep,
time_sep=time_sep,
micro_or_tz=micro_or_tz,
).startswith(f) and f not in excluded_formats):
return True
return False

would return false if the format were for example %Y-%. I've changed it to raise if format_len isn't 0 anyway, thanks

@@ -104,19 +105,30 @@ int parse_iso_8601_datetime(const char *str, int len, int want_exc,
while (sublen > 0 && isspace(*substr)) {
++substr;
--sublen;
if (format_len < 1 && exact) goto parse_error;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried clang-format and it formatted it without the braces, i.e.

        if (format_len < 1 && exact)
          goto parse_error;

so I've gone with that for now

'--extensions=c,h',
'--headers=h',
--recursive,
'--filter=-readability/casting,-runtime/int,-build/include_subdir,-readability/fn_size'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parse_iso_8601_datetime is now more than 500 non-comment lines, so I've turned off that check for now

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separate to this but I would also be fine exploring another tool aside from cpplint. I haven't seen many other projects use this tool, particularly for C instead of C++. clang has a formating and static analyzer tool that would likely be more useful

@@ -104,19 +107,44 @@ int parse_iso_8601_datetime(const char *str, int len, int want_exc,
while (sublen > 0 && isspace(*substr)) {
++substr;
--sublen;
if (format_len < 1) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK looking great. Now I think we've established a pretty clear pattern for a function. So with inputs of:

  • format string
  • format length remaining
  • characters to compare
  • whether exactness matters or not

I think you can now convert this back into the reusable function you always wanted. Something like:

// This function will advance the pointer on format
// and decrement characters_remaining by n on success
// On failure will return -1 without incrementing
int compare_format(const char **format, int *characters_remaining, 
                 const char *compare_to, int n, const int exact) {
  if (*characters_remaining < n) {
    if (exact) {
      // TODO: in the future we should set a PyErr here 
      // to be very clear about what went wrong
      return -1;
    } else {
      // TODO: same return value in this function as
      // above branch, but stub out a future where
      // we have a better error message
      return -1;
    }
  }
  else {
    if (!strncmp(*format, compare_to, n)) {
      // TODO: PyErr to differentiate what went wrong
      return -1;
    } else {
      *format += n;
      *characters_remaining -= n;
      return 0;
    }
  }
}

Calling this like compare_format(&format, &format_len, '%y', 2, check) should do the mutation you are looking for and give us a better pattern to set explicit error messages about what went wrong. Ideally would do this along with this function, but I think in a follow up it would be better if we replaced the goto with a -1 return value + setting a PyError, as that is a common practice in CPython development

@MarcoGorelli
Copy link
Member

MarcoGorelli commented Nov 13, 2022

Thanks @WillAyd for your review, I've learned a few new things!

I wasn't sure how to pass a character (like ymd_sep) to strncmp, so I did const char tmp[1] = {ymd_sep}, is that alright?

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice work lgtm @mroeschke

@mroeschke
Copy link
Member

Just a merge conflict otherwise very nice!

MarcoGorelli added 2 commits November 17, 2022 10:04
…er/format_iso

Co-authored-by: fdrocha <>
Co-authored-by: nikitaved <>
>
>
Co-authored-by: fdrocha <>
Co-authored-by: nikitaved <>
@MarcoGorelli MarcoGorelli added this to the 2.0 milestone Nov 17, 2022
@MarcoGorelli
Copy link
Member

Thanks Will and Matthew for your reviews/approvals!

Let's ship this then, and thanks @nikitaved and @fdrocha for the collaboration

@MarcoGorelli MarcoGorelli merged commit d2630d8 into pandas-dev:main Nov 17, 2022
@nikitaved
Copy link
Contributor Author

nikitaved commented Nov 17, 2022

Thank you very much, @MarcoGorelli, @fdrocha , @WillAyd , @mroeschke , for this amazing experience! I would really love to contribute more :)

MarcoGorelli pushed a commit to MarcoGorelli/pandas that referenced this pull request Nov 18, 2022
…SO8601 (pandas-dev#49333)

* initial format support

Co-Authored-By: MarcoGorelli <>
Co-Authored-By: FDRocha <>

* set exact=False default in objects_to_datetime

* 🏷️ typing

* simplify

* replace macro with function

* clean up

* 📝 restore docstring

* inline

* set format default to None

* clean up

* remove function, perform check inline

* only compare *format++ if format_len

* clean up

* typing

* split out branches

* use compare_format function

* remove tmp variable

* Add co-authors
>
>
Co-authored-by: fdrocha <>
Co-authored-by: nikitaved <>

Co-authored-by: Marco Gorelli <>
mliu08 pushed a commit to mliu08/pandas that referenced this pull request Nov 27, 2022
…SO8601 (pandas-dev#49333)

* initial format support

Co-Authored-By: MarcoGorelli <>
Co-Authored-By: FDRocha <>

* set exact=False default in objects_to_datetime

* 🏷️ typing

* simplify

* replace macro with function

* clean up

* 📝 restore docstring

* inline

* set format default to None

* clean up

* remove function, perform check inline

* only compare *format++ if format_len

* clean up

* typing

* split out branches

* use compare_format function

* remove tmp variable

* Add co-authors
>
>
Co-authored-by: fdrocha <>
Co-authored-by: nikitaved <>

Co-authored-by: Marco Gorelli <>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: pandas.to_datetime() does not respect exact format string with ISO8601
4 participants